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Abstract. This paper studies statistical aggregation procedures in regression setting. A 
motivating factor is the existence of many different methods of estimation, leading to possibly 
competing estimators. 

We consider here three different types of aggregation: model selection (MS) aggregation, 
convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the 
optimal single estimator from the list; that of (C) is to select the optimal convex combination 
of the given estimators; and that of (L) is to select the optimal linear combination of the 
given estimators. We are interested in evaluating the rates of convergence of the excess 
risks of the estimators obtained by these procedures. Our approach is motivated by recent 
minimax results in Nemirovski (2000) and Tsybakov (2003). 

There exist competing aggregation procedures achieving optimal convergence separately 
for each one of (MS), (C) and (L) cases. Since the bounds in these results are not directly 
comparable with each other, we suggest an alternative solution. We prove that all the three 
optimal bounds can be nearly achieved via a single "universal" aggregation procedure. We 
propose such a procedure which consists in mixing of the initial estimators with the weights 
obtained by penalized least squares. Two different penalities are considered: one of them is 
related to hard thresholding techniques, the second one is a data dependent Li-type penalty. 



In this paper we study aggregation procedures and their performance for regression models. 
Let T> n = {(Xi,Yi), . . . , (X n , Y n )} be a sample of independent random pairs (Xj, Y{) with 



where / : X — > R is an unknown regression function to be estimated, X is a Borel subset of 
M d , the Xj's are either random vectors with probability measure fx supported on X or fixed 
elements in X, and the errors Wi are zero mean random variables, conditionally on the AVs. 

Aggregation of arbitrary estimators in regression models has recently received increasing 
attention: Nemirovski (2000), Juditsky and Nemirovski (2000), Yang (2000, 2001, 2004), 
Catoni (2001), Gyorfi et al. (2002), Wegkamp (2003), Tsybakov (2003), Birge (2003). A 
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motivating factor is the existence of many different methods of estimation, leading to possi- 
bly competing estimators. Local polynomial kernel smoothing methods and penalized least 
squares or likelihood estimators (which include B-splines and wavelet type estimators) are 
two classes of methods that cover the major trends in nonparametric estimation in regression. 
When no method is a clear winner, one may prefer to combine different estimators obtained 
via different methods. Furthermore, within each method one can obtain competing estima- 
tors for different values of the smoothing parameter (the bandwidth in kernel procedures 
and, for the other examples, the calibrating constant in the penalty term or, correspondingly, 
the threshold value). This is usually the case when adaptive estimation is considered. In 
all these situations we are faced with a large collection of concurrent estimators fx , . . . , fu ■ 
A natural idea is then to look for a new, improved, estimator / constructed by combining 
fx , ■ ■ ■ , fu i n a suitable way. Such an estimator / is called aggregate and its construction is 
called aggregation. 

There exist three main aggregation problems: model selection (MS) aggregation, convex 
(C) aggregation and linear (L) aggregation. They are discussed in detail by Nemirovski 
(2000). The objective of (MS) is to select the optimal (in a sense to be defined) single 
estimator from the list; that of (C) is to select the optimal convex combination of the given 
estimators; and that of (L) is to select the optimal linear combination of the given estimators. 

In this paper we consider a more general setup for the (MS), (C) and (L) aggregation 
problems, following Tsybakov (2003). Namely, we do not restrict aggregates to be of the 
form of model selectors, convex or linear combinations of the original estimators. Instead, 
we only require that aggregates should be estimators that mimic the model selection, convex 
or linear oracles. This allows us to construct more powerful aggregates. To give precise 
definitions, denote by \\g\\ = (J g 2 (x)fi(dx))^ the norm of a function g in L2(M d ,/z) and set 
fx = ^2jLx ^jfj f° r an y ^ = (^!> • • • ' ^ ^he performance of an aggregate / used to 
estimate a function / G L2(M d ,/i) can be judged against the following mathematical target: 



where £± n ,M > is a remainder term independent of f characterizing the price to pay for 
aggregation, and the set H M is either the whole (for linear aggregation), or the simplex 
A M = | A = (Ai, . . . , \ M ) € R M : Xj > 0, YlfLx X j ^ l \ ( for convex aggregation), or the set 



pectation with respect to the joint distribution of (Xx,Yx), ■ ■ ■ , (X n ,Y n ) under model 

The random functions fA attaining inf Ae ^M EjHf^ — /|| 2 in (|1.2j) for the three values taken by 



(1.2) 





Here and later Ej denotes the ex- 
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H are called (L), (C) and (MS) oracles, respectively. Note that these minimizers are not 
estimators since they depend on the true /. 

We say that the aggregate / mimics the (L), (C) or (MS) oracle if it satisfies (|1.2|) for 
the corresponding set H M , with the minimal possible price for aggregation A nj M- Minimal 
possible values £± n ,M for the three problems can be defined via a minimax setting and they 
are called optimal rates of aggregation [Tsybakov (2003)] and further denoted by tp n ,M- As 
shown in Tsybakov (2003), for the Gaussian regression model we have, under mild conditions 



(1.3) Vn,M 



M/n for (L) aggregation, 

M/n for (C) aggregation, if M < y/n, 



{\og(l + M/y/n)} jn for (C) aggregation, if M > y/n, 

(log M)/n for (MS) aggregation. 

This implies that linear aggregation has the highest price, (MS) aggregation has the lowest 
one, and convex aggregation occupies an intermediate place. The oracle risks on the right in 
(|1.2|) satisfy a reversed inequality: 

- /II 2 > - /II 2 > m m e iVx - /|», 

since the sets over which the infima are taken are nested. Thus, the bound (|1.2|) for (MS) 
aggregation realizes the trade-off between the largest oracle risk and the smallest remainder 
term. The bound (|1.2j) for (L) aggregation realizes the trade-off between the smallest oracle 
risk and the largest remainder term. The bound (|1.2|) for (C) aggregation realizes the trade- 
off between an intermediate oracle risk and intermediate remainder term. If the number of 
estimators to be aggregated is small, M < y/n, the remainder term in the (C) bound is 
identical to that in the (L) bound, but the oracle risk in the (L) bound is always superior to 
that in the (C) bound. Thus (L) aggregation is preferable to (C) aggregation in this case, 
but no comparison can be made with (MS) aggregation. If the number of estimators to be 
aggregated is large, M > y/n, the remainder term in the (L) bound becomes too large, but, 
in a strict sense, there is no winner among the three aggregation techniques. The question 
how to choose the best among them remains open. 

The ideal oracle inequality (|1.2|) is available only for some special cases. See Catoni (2001) 
for (MS) aggregation in Gaussian regression; Nemirovski (2000), Juditsky and Nemirovski 
(2000), Tsybakov (2003) for (C) aggregation with M > y/n; and Tsybakov (2003) for (L) 
aggregation with known marginal measure /i and for (C) aggregation with M < y/n. For 
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more general situations there exist less precise results of the type 
(1.4) Ef\\f-ff<C inf M E f \\f x -ff + A n!M , 

where Co > 1 is a constant independent of / and n, and &. n ,M is a remainder term, not 
necessarily having the same behavior in n and M as the optimal one ipn,M- A disadvantage 
of (|1.4|) over (|1.2|) is that, when the oracle risk R* = inf^ e ^i\/ !Ej||f>, — f\\ 2 is large, the 
additional term (Co — 1)R* on the right-hand side of (|1.4j) may be much larger than the 
remainder term A nj M, thus substantially spoiling the convergence properties. This effect is 
less pronounced if Co = 1 + £ for some arbitrarily small e > or for e = e n — > as n ^ oo. 

Bounds of the type (|1.4|) in regression problems have been obtained by many authors 
mainly for the model selection case (when H M is the set of vertices of the simplex A M ), see, 
for example, Kneip (1994), Barron et al. (1999), Lugosi and Nobel (1999), Catoni (2001), 
Gyorfi et al. (2002), Baraud (2000, 2002), Bartlett et al. (2002), Wegkamp (2003), Birgc 
(2003), Bunea (2004), Bunea and Wegkamp (2004), and the references cited in these works. 
Most of the papers on model selection treat particular restricted families of estimators, such 
as orthogonal series estimators, spline estimators, etc. An interesting recent development 
due to Leung and Barron (2004) covers model selection for all estimators admitting Stein's 
unbiased estimation of the risk. There are relatively few results on (MS) aggregation when the 
estimators are allowed to be arbitrary, see Catoni (2001), Yang (2000, 2001, 2002), Gyorfi et 
al. (2002), Wegkamp (2003), Birge (2003), and Tsybakov (2003). Here we make the standard 
assumption that fx, ... , Jm are uniformly bounded, but otherwise they can be arbitrary. 

Various convex aggregation procedures for nonparametric regression have emerged in the 
last decade. They include bootstrap based methods, as suggested by LeBlanc and Tibshirani 
(1996) and cross-validation based stacking, as in Wolpert (1992) or Breiman (1996). The 
literature on oracle inequalities of the type ()1.2|) and (|1.4|) for the (C) aggregation case is not 
nearly as large as the one on model selection. Juditsky and Nemirovski (2000), Nemirovski 
(2000) propose a stochastic approximation algorithm that achieves the bound (|1.2jl for (C) 
aggregation with optimal rate i/j n M i n the case M > re/ log n. They also show that the 
bound is achieved by usual (non-penalized) least squares convex aggregation. Yang (2000, 
2001, 2004) suggest several methods of convex aggregation, in particular ARM (adaptive 
regression by mixing). He proves bounds of the form (|1.4|) with constants Co that are typically 
much larger than 1 and with rates &. n ,M that can be equal or approximately equal to the 
optimal rates i/j n M when M is a power of n. Audibert (2003) establishes (|1.2j) for a PAC- 
Bayesian method of convex aggregation with almost optimal rates, up to a logarithmic factor. 
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Birge (2003) suggests a convex aggregation method satisfying (|1.4j) with a constant Co that 
can be much greater than 1 and with a rate that is optimal for M > yfn and suboptimal 
for M < y/n. On the other hand, Koltchinskii (2004, Section 8) proves (|1.2j) for a convex 
aggregate / with optimal rate for M < y/n and with almost optimal rate for M > \/n. 

Linear aggregation procedures have received substantially less attention. For regression 
models with random design, a procedure achieving the bound (|1.2|) with optimal rate Vn,Af 
of (L) aggregation can be found in Tsybakov (2003). For Gaussian white noise models, linear 
aggregation has been discussed earlier by Nemirovski (2000). 

Aggregation procedures are typically based on sample splitting. The initial sample T> n is 
divided into two independent subsamples 2?^ and T>J of sizes m and £, respectively, where 
m 3> I and m + 1 = n. The first subsample (called training sample) is used to construct 
estimators f\ , . . . , Jm and the second subsample T>j (called learning sample) is used to aggre- 
gate them (i.e., to construct /). In this paper we do not consider sample splitting schemes 
but rather deal with an idealized scheme. Following Nemirovski (2000), the first subsample 
is fixed and thus instead of estimators f\ , . . . , /m , we have fixed functions f\ , . . . , Jm ■ That 
is, we focus our attention on learning. Our aim is to find estimators based on the sample 
T> n that would mimic simultaneously the linear, convex and model selection oracles with the 
fastest possible rates (or, equivalently, with the smallest possible remainder terms A n> jy-). A 
passage to the initial model is straightforward: it is enough to condition on the first subsam- 
ple, to use the learning bounds of the type fi.2|) . (|1,4[) obtained for the idealized scheme, and 
then to take expectations of both sides of the inequalities over the distribution of the whole 
sample V n . 

Another interpretation of aggregation of fixed functions fi , . . . , Jm is related to parametric 
regression for linear models of dimension M, where M can be very large or increasing with 
n. In fact, assume that both and fj = fj are fixed (non-random), and consider the linear 
regression model with design matrix (fj(Xi)) 1<i<n 1<jXM and the empirical counterpart of 
the norm || • || defined by 



Then, for H M = A M or H M = M. , the value inf^^M ||f^ — /||^ represents the best least 
squares approximation of an unknown function / at points Xi by the convex or linear span, 
respectively, of the columns of the design matrix. Consequently, estimators / satisfying oracle 
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inequalities of the form 

(1-5) - fill < C inf ||f A - ff n + A n , M 

mimic the best linear/convex least-squares approximation of / in a parametric regression 
framework, provided Co > 1 is close to 1. In (ll.5[) . A n ^ can be interpreted as the price to 
pay for the dimension M of the regression model, and we will show that (for an appropriate 
choice of the aggregate /) A Uj m = tp n ,M, where ip n< M is the optimal rate of aggregation as 
defined in (|1.3j) . For the case of linear aggregation, this can be viewed in the spirit of earlier 
work on linear models with growing dimension M [Yohai and Maronna (1979), Portnoy 
(1984)], but here we obtain non-asymptotic results and our risk is defined in terms of the 
regression functions and not in terms of their parameters. 

Given the existence of competing aggregation procedures achieving either optimal (MS), or 
(C), or (L) bounds, there is an ongoing discussion as to which procedure is the best one. Since 
this cannot be decided by merely comparing the optimal bounds, we suggest an alternative 
solution. We show that all the three optimal (MS), (C) and (L) bounds can be nearly achieved 
via a single aggregation procedure. Consequently, the smallest of the three will be achieved. 
Our answer will thus meet the desiderata of both model selection and model averaging. 

The procedures that we suggest for aggregation are based on penalized least squares. We 
consider two penalties that can be associated with soft thresholding {L\ or Lasso type penalty) 
and with hard thresholding, respectively. 

In Section 3.1 we show that a hard threshold aggregate satisfies inequalities of the type 
(|1.5|) . with Co arbitrarily close to 1, and with the optimal remainder term ij) n M- We establish 
the oracle inequalities for all three sets H M under consideration, hence showing that the hard 
threshold aggregate achieves simultaneously the (MS), (C) and (L) bounds when the empirical 
norm || • || n is used to define the risk. 

In Section 3.2 we study the performance of a slightly different hard threshold aggregate 
under the L 2 (R d ,n) norm. We show that this aggregate satisfies simultaneously the oracle 
inequalities of the type 1)1.4(1 corresponding to the (MS) and (C) bounds, with a remainder 
term A Ut M that possibly differs from the optimal ip n> M in a logarithmic factor, and with Co 
arbitrarily close to 1. 

In Section 4 we study aggregation with the L\ penalty and we obtain (|1.5|) simultaneously 
for the (MS), (C) and (L) cases, with Co arbitrarily close to 1 and with a remainder term 
A nj M that differs from the optimal ip n M only in a logarithmic factor. 
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Finally, we study lower bounds for (MS) and (L) aggregation in the fixed design case in 
Section 5, complementing the results obtained for the random design case by Tsybakov (2003). 

2. Notation and assumptions 

The following two assumptions on the regression model are supposed to be satisfied 

throughout the paper. 

Assumption (Al) The random variables W{ are independent and Gaussian N(0,o~ 2 ). 

Assumption (A2) The functions f : X -> R and fj : X -> R, j = 1, . . . , M, with M > 2, 
belong to the class JF of uniformly bounded functions defined by 



where L < oo is a constant that is not necessarily known to the statistician. 

The normality assumption (Al) on the distribution of errors is convenient since we need 
certain exponential tail bounds in the proofs (see Lemma l3. 101 below). For example, bounded 
regression can be easily incorporated in this framework using maximal inequalities due to Ta- 
lagrand (1994a, b) and Panchenko (2003). More generally, subgaussian errors are allowed at 
the cost of increasing technicalities, see Van de Geer (2000). In order to retain a transparent 
presentation of both the results and proofs, we confine ourselves to the Gaussian regression 
framework. 



The functions fj can be viewed as estimators of / constructed from a training sample (see 
the Introduction). Here we consider the ideal situation in which they are fixed, i.e., we 
concentrate on learning only. The learning method that we propose is based on aggregating 
the fj's via penalized least squares. 




For any A = (Ai 



. ..,A M ) £R M , define 



M 



f A(s) = ^\jfj{x). 



3=1 
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For each A = (Ai, . . . , Am) £ TSL , let M(A) denote the number of non-zero coordinates of 

A: 

M 

M(A) = H^j + 0} = Card J(A) 

3=1 

where /{■} denotes the indicator function, and J(A) = {j G {1, . . . ,M} : Xj ^ 0}. Introduce 
the residual sum of squares 

^) = ^i>*- f A(*;)} 2 - 

i=l 

Given a penalty term pen(A), the penalized least squares estimator A = (Ai,...,Am) is 
defined by 

(2.1) A = arg min (5(A) + pen(A)) , 

xeR M L J 

which renders in turn the aggregated estimator 

Since the vector A can take any values in M. M , the aggregate / is not a model selector in the 
traditional sense, nor is it necessarily a convex combination of the functions fj. Nevertheless, 
we will show that it mimics the (MS), (C) and (L) oracles when one of the following two 
penalties is used: 

penW ^ log ( 1 + _^L_) 

or 

M 

(2.3) pen(A) = ^ r nJ \Xj\, 

3=1 

where K\ > is a constant independent of M,n, and r nj 's are the data-dependent weights 
defined in (jOj) . 

We refer to the penalty in (|2.2j) as hard threshold penalty. This is motivated by the 
well known fact that, in the sequence space model (i.e., when the functions /i, . . . , /m are 
orthonormal with respect to the scalar product induced by the norm || • || ra ), the penalty 
pen(A) ~ M(X) leads to A^'s that are hard thresholded values of the Yj-'s (see, for instance, 
Hardle et al. (1998), page 138). Our penalty (|2.2|) is not exactly of that form, but it differs 
from it only in a logarithmic factor. 

The penalty ()2.3j) . again in the sequence space model, leads to Aj's that are soft thresh- 
olded values of Yj's. We will call it therefore soft threshold penalty or L\-penalty. Penalized 
least squares estimators with soft threshold penalty pen(A) ~ ^2!f=\ \ W are closely related to 
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Lasso-type estimators [Tibshirani (1996), Efron et al. (2004)]. Our results show that, with 
r nj -'s defined by 1)4. 3 J) , the soft threshold penalty allows near optimal aggregation. The same 
is true for the hard threshold penalty (|2.2|) under somewhat different conditions. 

In what follows, we denote by C, C\, C 2 , . . . finite positive constants, possibly different on 
different occasions. 



3. Near optimal aggregation with the hard threshold penalty 

3.1. The fixed design case. In this section we show that the penalized least squares es- 
timator using a penalty of the form (|2.2|) achieves simultaneously the (MS), (L), and (C) 
bounds of the form (|1.5|) with the correct rates A. Ut M = ipn,M- Consequently, the smallest 
bound is achieved by our aggregate. The results of this section are established for the em- 
pirical loss ||/ — f\\n- The next theorem presents an oracle inequality which implies all the 
three bounds. 



Theorem 3.1. Let Xj £ X, i = 1, . . . , n, be fixed. Let f be the penalized least squares estimate 
defined in \2.1\l with penalty \2.ty) . There exist constants C\, C 2 > such that for all a > 1, 
for K\ = K^aa 2 , with Kq > large enough, and for all integers n > 1 and M > 2, 

(3.1) E f \\f-f\\ 2 n 

< mi { f A - / \\i + daa 2 — — log 1 + — — — - ) > + C 2 . 

~AeR w \a-l" " n n \ M(A) V I J J n 

This theorem is proved in Section 3.3. The following three corollaries present bounds of 
the form (|1.5|) for (MS), (L), and (C) aggregation, respectively. 

Corollary 3.2 (MS). Let the assumptions of Theorem \3.1\ be satisfied. Then there exists a 
constant C3 > such that for all e > 0, for K\ = Ki(e, a 2 ) large enough and for all integers 
n>l and M > 2, 

E f \\f- ff n <(l + s) inf ||/, -ff n + C 3 a 2 (l + e' 1 ) 

Proof. Since the infimum on the right of (|3,1|) is taken over all A G 1 M , the bound easily 
follows by considering only the subset consisting of the M vertices (Ai, . . . , Am) = (1)0,..., 0), 
(0,1,0,... ,0),... ,(0,... ,0,1) in A M , and by putting a = 1 + 2/e. □ 
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Corollary 3.3 (L). Let the assumptions of Theorem I.V.il be satisfied. Then there exists a 
constant C3 > such that for all e > 0, for K\ = K\{e, a 2 ) large enough and for all integers 
n > 1 and M > 2, 

E/||7- f\\ 2 n < (1 +e) inf ||f A -f\\ 2 n + C 3 a 2 (l + e' 1 ) M. 

Proof. Since x 1— > xlog(l + M/x) is increasing for 1 < x < M, 

M(X) ( M \ M 

SU P ~ ^ !og 1 + TF7TTT7T = ~ lo S 2 " 

The result then follows from 1)3. 1|) with a = 1 + 2/e. □ 



Corollary 3.4 (C). Let i/ie assumptions of Theorem V3.1\ be satisfied. Then there exists a 
constant C3 > depending on L and a 2 such that for all e > 0, for K\ = Ki{e,a 2 ) large 
enough and for all integers n > 1 and M > 2, 



E/ll/ - /||£ < (1 + e) inf ||f A - /||* + C' z (1 + e + e' 1 ) ^ (M), 
AeA M 



w/iere 

Proof. For M < i/n the result follows from Corollary 13.31 Assume now that M > y/n and 
let m be the integer part of 



M/n ifM<^n, 
v/{log(l + M/y/n)}/n if M > y/n. 



X n ,M 



n log 2 



/ log(l + M/y/n) ' 

Clearly, < m < x Uj m < M. First, consider the case m > 1. Denote by C the set of functions 
/t of the form 

^ Af m 

h(x) = — ^2kjfj(x), kj € {0, 1, ... ,m}, ^ % < m. 
m 3=1 i=i 
The following approximation result can be obtained by the "Maurey argument" (see, for 

example, Barron (1993), Lemma 1, or Nemirovski (2000), pages 192, 193): 

(3-2) min \\g - f\\ 2 n < mm ||f A - f\\ 2 n + —. 

g eC AeA M m 

For completeness, we give the proof of 1)3.2)1 in the Appendix. Since M (A) < m < x n ^u for the 

vectors A corresponding to g S C, and since x 1— > xlog (l + — ) is increasing for 1 < x < M, 

we get from ()3.1)) : 



a + ,, |2 . ^ 2 ^.m , (, . M \\ C 2 aa 



E/||/ — /US < inf ^ ^\\g - f\\l + CW ^ log ( 1 + -p- ) }> + 



T 2 



AGGREGATION FOR REGRESSION LEARNING 



11 



Using this inequality, (|3.2j) and the fact that m = L^n,AfJ > £n,M/2 for x n ,M > 1, we obtain 

(3-3) KfWf-fW* < ^ ||f A - ff n + (i±l) i^- 

o — 1 agA m \a — 1/ « n ,M 

+ Ciaa — — log I H ) + 



n V x nM J n 



We use this bound for all choices of A G A M with m > M (A) 7^ 0. For m = 0, we only need 
to consider the singular case A = as M(A) = if and only if A = 0. Note that for m = 0, 
we have l/x n> M > 1, arid we use the trivial upper bound 

a — 1 n \ a — 1 / \ n log 2 / 

for the right-hand side of (|3,1|) . 

To complete the proof of the Corollary, it remains to put a = 1 + 2/e and to note that 

, / M \ , / M 

log 1 + <21og 1 + -= 

V x n M J V v n 



in view of the elementary inequality log ^1 + (log 2) 1//2 yy / log(l + y) J < 21og(l + y), for all 
y>l. □ 



We remark now that the aggregate considered in Theorem 13.11 satisfies also the bounds "in 
probability" that are similar in spirit to ()3.1|) and its corollaries. 

Theorem 3.5. Let Xi £ X, i = l,...,n, be fixed. Let f be the penalized least squares 
estimate defined in \2.1)) with penalty \2.ty) . There exist constants Ci,L\ : L% > such that 
for all a > 1, for K\ = K^aa 2 , with Kq > large enough, and for all integers n > 1, M > 2 
and any 5 > 0, 

(3.4, P^./IIUW, (£±>|| fl - / || i+ Clo ^ log ( 1 + -^)} +S ) 

f r 

< Li exp -L 2 — 2 • 

As in the case of Theorem 13.11 we can consequently obtain the analogues of Corollaries 
13.21 - 13.41 by replacing the infimum in (|3.4[) by its particular form for the cases (MS), (L) and 
(C), respectively. We do not include each case, for brevity. 
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3.2. The random design case. In this subsection we show that an oracle inequality similar 
to 1)3. 1JI continues to hold if the empirical norm || • || n is replaced by the L2(M. d ,/j) norm || • ||. 
This result is more difficult to obtain and we do not achieve exactly the same bounds. 

We need to restrict minimization of the penalized sum of squares to a bounded set in M. M . 
Define, for any T > 0, 

M 

A MT = J A G R M : ^2\Xj\ < T 

{ 3=1 

The penalty term needs to be chosen slightly larger than before: 
(3.5) pen(A) = ;fl MA) log f 1+ *'V„ 



n & V M{\) V 1, 
for some large K\ > 0. We note that here K\ is not necessarily the same as in (|2.2|) . we just 
use the same notation for factors in the penalty term. 

Theorem 3.6. Assume that X%, . . . , X n are independent random variables with common prob- 
ability measure fi. Let T < oo be fixed, and set 

B = L 2 (T + l) 2 . 



Let f = h where 



A = argmin{S'(A) + pen(A)} 



with the penalty given in \3. 5\) . Then there exist constants C\ } C2 > such that for all a > 1, 
for K\ = Ki(a, B, a 2 ) large enough, and for all integers n > 1 and M > 2, 

(3.6) E^lJ-ZH 2 

< mf <^ -||f\ — /|| + C\aa — log 1 + -77777-— \ + Ci- 



agA m , t [a-l" " n V M W V 1/ J n 

Because of the slight increase in the penalty, the remainder term in (|3.6|) is somewhat 
larger than the one given in (|3,1|) : we now have M V n in place of M under the logarithm. 

As corollaries, one obtains the following (MS) and (C) bounds for the estimator / defined 
in Theorem 13.61 



Corollary 3.7 (MS). Let the assumptions of Theorem \ Zt.b\ be satisfied and T > 1. Then 
there exists a constant C > such that for all e > 0, for K\ = Ki(e,a 2 ) large enough and 
for all integers n > 1 and M > 2, 

E/II7- /II 2 < (1 + e) inf ||/, - /|| 2 + Ca 2 (l + e^) bg(MVw) . 

i<3<m n 
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Corollary 3.8 (C). Let the assumptions of Theorem V-i.fH be satisfied and T > 1. Then there 
exists a constant C > depending on L and a 2 such that for all e > 0, for K\ = Ki(e,a 2 ) 
large enough and for all integers n > 1 and M > 2, 

E/||7- /|| 2 < (1 + e) inf ||f A - /|| 2 + C (1 + e + e' 1 ) ffi (M), 

where 

' (M log n) fn if M < y/n, 



V{log(l + (M V n)/y/n)}/n if M > 



n. 



As compared to Corollaries 13.21 and 13.41 these results present slightly different rates of 
convergence: here the factor logM is replaced by logn for values M < n. The proofs are 
omitted since Corollaries 13.71 and 13.81 readily follow from the oracle inequality (|3.6j) and the 
fact that A M C A MT for T > 1 via an argument similar to the proofs of Corollaries 13.21 and 



3.3. Proof of Theorem 13.11 Let A be a fixed, but arbitrary point in R M . Define for all 

1 < m < M, 

A m {\) = {A = A'-AGR A/ : M(A') = m}. 

Let Jk, k = 1, . . . , (^) , be all the subsets of {1, . . . , M} of cardinality m. Define 

A m , k (\) = {A = (Ai, . . . , A M ) G A m (X) : A^ & j G J k ) 

where X'j = Xj + Xj. The collection |^4 m) fc(A) : 1 < k < (^)| forms a partition of the set 
A m (X). Furthermore, define affine subspaces of M. n of the form 

B m ,k(X) = {h= (h(Xi), . . .,h(Xn)) £»": AG A mM (X)} 

and let 11^ k W denote the projection of the vector W = (W±, . . . , W n ) onto B m ^(X). Clearly, 
dim(B m ^(X)) < m. Finally, we define for each 7 G W M , 

n i=1 W'lWn 

def 

and V n {"i) = 0, otherwise. 
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Lemma 3.9. For all a > 1,6 > and A G R M , we have 

wl-nl < ^^-/i + ^^W^ M 



b a-l" A a-1 n °\ M(A) V l y 

° {, nilrrA Tx r ll2 Kim ^ / M 

+ 7 max max ^ (a + 6) Ili fc W £ — log 1 + 



mVl 



Proof. By the definition of A, for any A 6 M M , 

5(A) + pen(A) < S{\) + pen(A). 

Rewriting this inequality yields 

11/ -/II* < ||fA-/||^ + 2(^/-f A ) n + pen(A)-pen(A), 

where <•,•>„ denotes the scalar product associated with the norm || ■ ||„. Since ||/ — f\\\ n = 
implies that ( W, f — f\\ = 0, we find 



n 

11/ -/II* < ||fA-/||*+2K(A-A)||/-f A || n + pen(A)-pen(A) 

< ||f A - f\\l + 2V n (X - A)||/ - /||„ + 2V n (\ - X)\\h - f\\ n + pen(A) - pen(A) 

< (1 + \)¥x ~ f\\l + «K 2 (A - A) + hi - f\\l + bV 2 (X - A) + pen(A) - pen(A), 

where a, b > are arbitrary, and we used the inequality 2xy < cx 2 + y 2 /c valid for all i,|/£R 
and c > 0. Consequently, for any a > 1, b > 0, we find 

\\f-f\\l < ^^H f A-/Hn + ^Pen(A) 

Next, since M M = U™ =0 U*=i An,fc(A), we find that 

(a + 6)y n 2 (A-A)-pen(A) 

= (a + 6)T/ 2 (A - A) - pen(A - A + A) 

< max max max {(a + b)V 2 (X) — pen(A + A)) . 

It remains to bound the term on the right in view of the last two displays. The case m = 
is degenerate as Aq{\) = ^0,1 (A) = {—A}. Note that for A = —A, 



(a + b)V 2 {\) - pen(A + A) = (a + b)V 2 (\), 
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since pen(O) = and = — f\. For each m > 1, we have 

max _ max {{a + b)V n (A) — pen(A + A)} 
< max max Ua + b)\\T^ , k Wf n - pen(A + A)) 

by the orthogonality of VF - n^ fc iy and (f^(Xi), . . . ,f x (X n )) for all A G An,fc(A) 

= max ((a + 6)||Ili fc W||2 - — mlogf 1 + -^V) 1 
!<*<(£) I ' » V mviyj 

in view of (|3.5[) and since M (A + A) = m for all A G A m fc(A). 
This concludes the proof of the lemma. □ 



From now on, we take a = b > 1. Since, by Assumption (Al), the errors Wj are normal 
iV(0, <7 2 ), the standardized statistic n<7~ 2 ||lL^ fc VF|| 2 has a x 2 distribution with m degrees of 
freedom for all 1 < k < f^) . The following tail bound for such a statistic will be useful. 

Lemma 3.10. Let denote a random variable having the x 2 distribution with d degrees of 
freedom. Then for all x > ; 



x 2 



(3.7) F{Z d -d> xV2d} < exp 



2(1 + x^J2/d) J 

Proof. See Cavalier et al. (2002), equation (27) at page 857. □ 

Lemma 3.11. There exists C > such that, for any integer n > 1 and any a > 1, K\ = K$ao~ 2 
with Kq > large enough, 



(3.8) E f max max ^allni k W\\l - — mlog (\ + — — H < C— . 
v ; T l<m<M x^fM) \ " m ' k " n n V m\l\))~ n 

(3.9) E^(A) < 
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Proof. Inequality (|3.9|) is trivial and we will prove only (|3.8j) . For any 5 > we have 

def 

r 

l<m<M KK («) 

— — \m / 



PS 



max max \ 2a||Il£, fc W||^ -mlog | H 

m V 1 



n 



> 5 



M UJ 

m=l fc=l 
A/ (m) 

= EE : 

m=l fc=l 
M 



2a||n£ lfc VF|| 2 - —mlog | 1 + — ^ ! > 



n 



m V 1 



Z m > — -m log 1 + — + 



2aa 2 



m J 2aa 2 



= E 

m=l 
M 

sE 

m=l 



M 



Z m -m K\ ^pm ( M\ y/m n5 
> - — --^ log I H I - -^=- + 



/2m 2aa 2 ^2 



mj y/2 2oo- 2 v / 2m. 



exp(-C ^— log + 



/?? y \ \ acr 2 



by Lemma I3.1UI for K\ = K^aa 2 with Kq > large enough and some universal constant 
Co > 0. Using the crude bound (^) < (eM/m) m [see, for example, Devroye et al. (1996), 
page 218], the inequality 1 + logx < 21og(l + x), V x > 1, and taking Kq such that CqKq > 4 
we get 

E(tO ex ^- c °^ log ( 1 + ^)) 5 I>(- ml <* (» + £)' 

m=l v ' x v 7 7 m=l x v ' 



< ^2 exp(— mlog2) < oo. 



m=l 



These inequalities finally yield the bound on the tail probabilities 



(3.10) 



PS < C 3 exp -C 4 



nS 



for some constants C3, C4 > 0, which easily implies the bound (|3.8|) on the expected value. □ 



Proof of Theorem lff.il Theorem 13.11 follows directly from Lemmas 13.91 and 13.11 



□ 



Proof of Theorem VJ. 51 First notice that, by Lemma 13.91 for a = b > 1 there exists C\ > 
such that 



/lln> M 



a + l, 



AeR M I a — 1 



fA-/l£+ cW 



.M(A) 



log 1 + 



M 



< 



max max \ 2a 1 1 n^,, fcW||^ 



a — 1 l<m<M i<k<( M ) 



Kim 



log 1 + 



M(A) V 1 
M 



+ <5 



mVl 



> <J/2 



2a 2 
a - 1 



V*(\)>5/2 . 
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Next, the rescaled variable na~ 2 V^(\) has a x 2 distribution with 1 degree of freedom. Com- 
bining the exponential bound for tail probabilities of x 2 random variables (Lemma 13. lflj) and 
the exponential bound (|3.1U|) completes the proof. □ 



3.4. Proof of Theorem 13.61 By the same reasoning as in the proof of Theorem 13. 11 
H/-/H 2 = (l + a )||7-/|| 2 + {||7-/|| 2 -(l + a)||/-/|| 2 } 

< (1 + a) {||f A - /|| 2 + 2 (W, f- h) n + pen(A) - pen(A)} 
+ {ll7-/l| 2 -(l + a)ll7-/ll^} 

pen(A) \ 
2 J 



(1 + aW ||f A -/||2 +2 (W,/-f A ) +pen(A) 



pen(A) } . 



+ <\\f-f\\ 2 -(l + a)\\f-f\\l-^ 



The first term on the right, provided K\ > is chosen large enough, can be handled in exactly 
the same way as in the proof of Theorem l3.ll It remains to study the second term on the right. 



Considering separately the cases M(A) = and 1 < M(A) < M we obtain 

i2 n , _mi7 jmi2 1 + a 



||/-/r-(i + a)||/-/||; 



< max < Uo, max sup 

I l<m<M X:M(\)=m 



-pen(A) 
U\ - ^-^pen(A) 



where U\ = ||f A - /|| 2 - (1 + a)||f A - /|| 2 . For each 1 < m < M, let the sets ^ m , fc (0), 
1 < k < (^), form a partitioning of the set A m (0) = {A G R M : M(A) = m}. Deduce that, 
for any 5 > 0, 

|2 ft , _Ml7 ,l|2 1 + a 



(3.11) 



f\\' -(i + a)||/-/||; 



-pen(A) > 5 



<¥{U >6/2} + Vp( sup ?7a>^(5)1 

m=1 [A:M(A)=m J 



where 



< P{C/ > 8/2} + Y.12 F { sup ^ A - D W 

m =lk = l [^ A m,k(0) 



(l + a)ifi , / raVM\ 5 

D(<5) = ^ i — -mlog 1 H + -. 

v; 2n 6 V mVl/ 2 
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The following result establishes a bound on the shatter coefficient of the class of subgraphs of 
the functions (f A — f) 2 that will be subsequently used to control the behavior of the empirical 
process on the right-hand side of (|3.11|) . 

Lemma 3.12. Let S(n,m,k) be the shatter coefficient of the collection of sets 

{(x,/3) : (f A - ff{x) > /3, > 0, x £ X] , AG A m , k (0). 
Then, for any 1 < m < M, 1 < k < (^f) , we have 



log S(2n, m, k) < Cm { 1 + log (l + — ) } 



rri' 

where C > is an absolute constant. 

Proof. Note that 

{(x,(3): (h-f)\x)>(3, (3>0) 

= {(x,/3) : f A (a;) - f(x) < -y^, P > o} U [(x,P) : f A (x) - f(x) > s/p, p > o} 

and recall that the VC-dimension of the collection of sets : f\(x) — f(x) > ^f]3, P > 0}, 

A G A TO> fe(0), is less than m + 1, cf. Theorem 13.9 of Devroye, Gyorfi and Lugosi (1996) or van 
de Geer (2000), page 40. Similarly, the VC-dimension of {(a?, P) : f\(x) - f(x) < -</P, P>0}, 
A £ A mt k(0), is less than m+1. Apply Lemma 15, page 18, in Pollard (1984) to deduce that 
the collection of sets {{x,P) : (fx - f) 2 {x) > P, P > 0}, A £ A rrhk (0), has VC-dimension V k 
less than m+1. The shatter coefficient S(2n, m, k) is related to the VC-dimension of the 
latter class by the inequality 

2n 



log S(2n, m, k) < V k j 1 + log ^1 + y 

see, for example, Theorem 4.3 on page 145 of Vapnik (1998). To conclude the proof, use the 
fact that the right-hand side is an increasing function of V k . □ 



Now, using the inequality D{5) + a||f A - f\\ 2 > 2y/aD(5)\\f\ - f\\ and Theorem 5.3* on 
page 198 of Vapnik (1998) we get 

P J sup U x > D(8) \ 

= P |3A £ A m , k (0) : ||f A - /|| ^ and (1 + a) [||f A - f\\ 2 - ||f A - f\\ 2 n ] > D(6) + a||f A - ff 

<p{ sup ^-fr-ih-n^^m) 

l,AGA m|fc (o):||f A -/||#o \\h-f\\ l + a J 



< 4S(2n, m, k) exp 



anD(5) 
{l + a) 2 B 
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Therefore 

M 



( M ) 



sup U x > D{5) 

AeA m jo) 



EE 

m=l k=l 

M ( M ) 

< 4 E E s ( 2n ' m > fc ) ex P I -7TE-S] 

m=l fc=l 



,1/ 

m 



exp < Cm 



1 + log 



s^E 

m=l 

by Lemma 13.121 
< C 5 exp f ~Ce^ ) V a > I . 



n 



\ / a^i 



aA'i m , / n V M 

— — log 1 H 

a)B 6 V mVl 



and 



2(1 + a) 2 B 



for i^i = Ki(a,B) large enough, and some universal constants C$,Cq > 0, where we have 
used the same crude bound for f ) as in the proof of Lemma 13.111 Furthermore, 



'{U >6/2} < 



9 \J 2a5 

z > 

1 + a 



< exp 



and 



< exp 



nS \ 
AaB) 



V a > 1, 



(l + a) 2 B 

where the last but one inequality follows, e.g., from Proposition 2.6 in Wegkamp (2003). The 
exponential bounds in the last two displays and (3.11) easily imply 

i±^pen(A))<C 7 ^ 
2 n 



E/ ||/-/r-(i + o)||/-/||; 



for some constant C-j > 0. This concludes the proof of Theorem 13.61 



□ 



4. Near optimal aggregation with a data dependent Li penalty 

We consider here only the fixed design regression. In addition to Assumptions (Al) and (A2), 
throughout this section we suppose the following. 

Assumption (A3) The matrix 

\ »=1 / l<j,j'<M 

is positive definite for any given n > 1 . 
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Let £ m i n be the smallest eigenvalue of the matrix ^f n . Note that under our assumptions 

(4.1) 0<Un<\\fj\\ 2 n <L 2 , j = l,...,M. 

We propose the aggregation procedure defined by the following choice of weights: 

(4.2) A = arg min { S(A) + pen(A) } 

where 

Am,t,2=|ag1R m : E A i^ T2 j> 
for T > large enough, and the penalty term is given by 

(4.3) pen(A)=^r n , i |A i | with r n>j = *fa\\f. 11 l 2l ^ M + ^ n 



in' 

• n 



Theorem 4.1. Let Xj E X, % = 1, . . . , n, be fixed. Let A be the penalized least squares estimate 
defined by \4-%fy with penalty \4-3{) . Set f = f-j. Let T > be such that T 2 £ m i n > 2L 2 . Then, 
for all a > 1, and all integers n > 1, M > 2, we have, 

(4.4) 1,117-/115 < M l^Wh - ft + —A^) 2 ' 0gM + l0S " M(A) 

{T + M-^fL 2 



+ 



n 



v / 7r(21ogM + logn)' 



Corollary 4.2 (MS). Let assumptions of Theorem \Jl\ be satisfied andT < (log(M Vn)) 1 / 4 . 
Then there exists a constant C = C(T,L,a 2 ,^ m - m ) > such that for all e > and for all 
integers n > 1 and M > 2, 

%ll7- < (1 + e) 1 inf ||/, - + C (1 + e + e- 1 ) lQg(MVn) . 

Proof. Using assumptions on T and (|4.1() . we trivially get T > 2L^ j £ m i n > M^ 1 / 2 . This 
implies that the last summand in (|4.4|) is 0(\jn). The rest of the proof is analogous to that 
of Corollary WIS. □ 

Corollary 4.3 (C). Let assumptions of Theorem Wl\ be satisfied and T < (log(M V n)) 1 / 4 . 
Then there exists a constant C = C(T, L, a 2 , £ m i n ) > such that for all e > and for all 
integers n > 1 and M > 2, 

E/||7- /|£ < (1 + e) inf ||f A - /|| 2 + C (1 + e + e" 1 ) v£(M), 
AeA M 
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where 

Wn{ ' ]y (log M)/n ifM>^E. 

Proof. We bound the last summand in (|4.4|) as in the previous proof and we use then the 
argument similar to that of the proof of Corollary 13.41 □ 



Corollary 4.4 (L). Let assumptions of Theorem be satisfied and T < (log(M V n)) 1 / 4 . 
Then there exists a constant C = C(T, L, a 2 , £ m in) > such that for all e > and for all 
integers n > 1 and M > 2, 

«/«/ - ft < (1 + e) M |fc - ft + C (1 + e + MM^lA . 

Proof. We bound the last summand in Q4.4|) as in the proof Corollary 14.21 and we use that 
M(A) < M. □ 

Proof of Theorem \4-l\ We begin as in Loubes and Van de Geer (2002). By definition, / = f~ 
satisfies 

M M 

S(\) + E r «,#il < S( x ) + E r ».il A il 

3=1 3=1 
for all A G A-m,t,2i which we may rewrite as 

M M 

11/ - /lln + E r ».A'l ^ ^ - /II" + E r «-il A il + 2 / - f \ 
i=i 3=1 

We define the random variables 



n 

and the event 



n . 
i=i 



M 

A=f]{2\V j \<r n , j }. 

3=1 

The normality assumption (Al) on Wi implies that yfnVj ~ iV (0, <T 2 ||/j||^), 1 < j < M. 
Applying the union bound followed by the standard tail bound for the iV(0, 1) distribution, 
yields 

M M . II f II / 2 \ 

(4.5) P(A<) < ^P{VS|^|>^2}<^A* exp I ^ \ 

pi pi V2tt Vnrnj \ 8a 2 



fj\\n 



Mny / Tr(2 log M + log n) ' 
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Then, on the set A, we find 

M M 

2 (W, f- f) = 2J2 Vjftj ~ A;) < r ^ ~ ^ 

3=1 3=1 

and therefore, still on the set A, 

M M M 

Wf-fWl < l|fA-/||^ + ^r nj |A J -A i |+^r n , i |A J |-^r nj |A J |. 

3=1 3=1 3=1 

Recall that J(A) denotes the set of indices of the non-zero elements of A, and M(A) 
Card J(A). Rewriting the right-hand side of the previous display, we find, on the set A, 

(M 
I>n,A--A,-|- T nAH 
3=1 3?JW 

\ jSJ(A) jeJ(A) 
< ||ft-/|| 2 +2 A, A, 

by the triangle inequality and the fact that Xj = for j J(A). Since £ m i n > 0, we have 



2 



Combining this with the Cauchy-Schwarz and triangle inequalities, respectively, we find fur- 
ther that, on the set A, 

(4.6) Wf-fWl < ||ft-/|| 2 +2 Y r n,f\j-\j\ 

ieJ(A) 



< \h-f\\l + 2^a Y r^(||/-/||„ + ||fA-/ll») 

V ieJ(A) 

< ||fA - /lln + 2^[ n r n ^M{X) (||7- /|| n + ||f A - /|| n ) , 



where 



r n ^2V2LaJ 2l0gM + l ° gn . 
V n 

Inequality ()4.6|) is of the simple form v 2 < c 2 + vb+cb with v = \\ f — f\\ n , b = 2r n ^J ' M "(A) / '£ m in 
and c = ||f A — /|| n . After applying the inequality 2xy < x 2 /a + ay 2 (x,y G M, a > 0) twice, 
to 26c and 26t>, respectively, we easily find v 2 < v 2 /(2a) + ab 2 + (2a + l)/(2a)c 2 , whence 
v 2 < a/(a- l){6 2 (a/2) + c 2 (a + l)/a} for a = 2a > 1. Recalling that (gSJ is valid on the set 
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A, we now get that 

Ef 



||/-/fal< inf (£±ll|f A -/l|a+ 2 f rlM(X)), V a > 1. 

J AgA m ,t,2 {a— 1 ?min(a — lj J 



Consequently, since by the Cauchy-Schwarz inequality, 

Af 

11/ - /Hoc < |A,| + 1) < (VMT + 

i=i 

we find 

1/11/ -f\\l < E f [\\f-f\\ll A ] +(VMT + 1) 2 L 2 F(A C ) 

(4-7) < inf !^±l\\f x -ff n+ 2fl2 g M(A) 

AeA A f,T,2 t a — 1 (a - l)£min 

(r + M- 1 / 2 ) 2 ^ 2 

n\Jir(2 log M + log n) 

It remains to show that ()4.7j) remains valid with the set Am,t,2 replaced by the entire W M . For 
this, observe that A A m ,t,2 implies J2jLi > T 2 \ an d thus \\f\\\n > £mm J2jLi > CminT 2 . 
Therefore, for A A.MT2, we have 



||fA " f\\n > \\h\\n ~ ll/lln > VUnT — L > L 

by our choice of T. On the other hand, for A = G Am,t,2> we have 

\\h-f\\n = \\f\\n<L 

and pen(O) = 0. Thus, the value of the whole expression under the infimum in ()4.7j) for 
A = is strictly smaller than the value of this expression for any A G" A-m,t,2, which proves 
the result. □ 

As in Section 3.1, we present now a statement in probability that complements the results 
of this section. 

Theorem 4.5. Let Xj G X , i = 1, . . . , n, be fixed. Let A be the penalized least squares estimate 
defined by \4-<$ with Am,t,2 replaced by R and with penalty \4-^ - Set f = f?. Then, for 
all a > 1, and all integers n > 1, M > 1, we have, 

(4.8) P(|7-/Ii> M , - /II?, + ^ (p?) 21 ° SM + '° g " M(A) 

V agk [a -I a - 1 V 4 min ) n 

1 

< 



Mn y/ir(2 log M + log n) ' 

Proof. This result follows directly from the proof of Theorem 14.11 Note first that now (|4.6j) 
is valid for all A G M M and not only for A G Using (j4.6|) and the argument after it we 
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find that the left hand side in (|4.H|) can be bounded by ¥(A C ). The result follows by invoking 

□ 



Remarks. 

1. The method presented in this section is not strictly an Li-penalized one. Indeed, it im- 
plements two penalties: the data dependent Li-penalty Ylj=i r n,j\^j\, an d the L2-penalty 
^2jLi that appears implicitly via the choice of the set Am,t,2- The resulting minimization 
problem can be solved in practice using standard convex programming software. The L2 
part of the penalty is less influential, since it should typically be applied with T^ooasM 
(respectively n) grows, which means that the restriction to Am,t,2 becomes asymptotically 
negligible. Moreover, the restriction is not always needed. For example, the bound in proba- 
bility (Theorem 14. 5|) is obtained for A that minimizes the Li-penalized least squares over the 
entire R M . 

2. Assumption (A3) is mild, and it is also made by Efron et al. (2004) in the context of 
LARS. In practice, this assumption can always be checked. A stronger assumption is that 
£min > c for some constant c > 0, independent of n and M if one or both of these parameters 
are allowed to grow (which is typically the more interesting case). There are at least two 
important examples where such a stronger assumption holds. The first example is standard 
in the parametric regression context: M is fixed and ^ n /n ^ where \P is a nonsingular 
M x M matrix. The second one is related to nonparametric regression: M = M n is allowed 
to go to 00 as n — > 00 and the functions fj are orthogonal with respect to the empirical norm. 
This corresponds, for instance, to sequence space models, where the estimators fj = fj are 
constructed from non- intersecting blocks of coefficients. Aggregating such mutually orthog- 
onal estimators may lead to adaptive estimators with good asymptotic properties [cf., e.g., 
Nemirovski (2000)]. Local image smoothing provides us an application where the condition 
Cmin > c is naturally satisfied. For example, Katkovnik et al. (2002, 2004) suggest differ- 
ent methods of aggregation of local image estimators obtained from non-intersecting sectors 
around a given pixel (these estimators are mutually orthogonal with respect to the empirical 
norm) . 
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3. Inspection of the proofs shows that the constants C = C(T, L, a 2 , £ m in) in Corollaries 
14.21 14.31 14.41 have the form C = A\ + A2^~J mJ where A\ and A2 are constants independent 
of £min- I* 1 general, £ m i n may depend on n and M. However, if £ m j n > c for some constant 
c > 0, independent of n and M, as previously discussed, the rates of aggregation given in 
Corollaries 14.21 14.31 14.41 are near optimal, up to logarithmic factors. They are even exactly 
optimal (c/. (|1.3|) and the lower bounds of the next section) for some configurations of n, M: 
for (MS)-aggregation if n a ' < M < n a , and for (C)-aggregation if n 1 / 2 < M < n a , where 
< a' < a < 00. 

4. From the bound in Theorem 14.11 we see that T is allowed to grow with n and M (as 
fast as T x (log(M V n)) 1 / 4 is possible). Moreover, the proof of Theorem 14. II reveals that by 
taking a larger constant than 2\/2 in (|4.3|) . even faster rates are allowed, for example, T can 
grow as a power of n. This may be needed to guarantee the condition T 2 > 2L 2 /£ m i n for n 
large enough, because the value L is typically not known and £ m i n may depend on n and M. 
However, the condition T 2 > 2L 2 /^ m j n is only needed to cover the linear aggregation. For 
(MS) and (C) aggregation, Corollaries 14.21 1431 can be obtained directly from (|4.7|) . and thus 
it suffices to take any T > 1, since A C Aa^i^, or to replace Am,t,2 by A M in the definition 
of A. 

5. Lower bounds 

For regression with random design and the L2(R d , //)-risks, lower bounds for aggregation and 
optimal rates Vv^M as given in (|1.3|) were established by Tsybakov (2003). In this section we 
extend the lower bounds of Tsybakov (2003) for (MS) and (L) aggregation to regression with 
fixed design. Further, we state these bounds in a more general form, considering not only the 
expected squared risks, but also other loss functions. This generalization allows one to treat 
optimality of the upper bounds "in probability" obtained in the previous sections (Theorems 
13.51 14.5j) . It shows that the remainder terms in these bounds are optimal or near optimal for 
the (MS) and (L) aggregation. 

In this section we suppose that X\, . . . , X n are fixed and that M <n. Let w : K — » [0, 00) 
be a loss function, i.e., a monotone non-decreasing function satisfying w(0) = and w ^ 0. 

Theorem 5.1. Let X{ & X, i = 1, . . . , n, be fixed and 2 < M < n. Assume that H M is either 
the whole M. M (the (L) aggregation case) or the set of vertices of A M (the (MS) aggregation 
case). Let the corresponding t/j n ,M be given by fP)) and let M log M <n for the case of (MS) 
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aggregation. Then there exist fi, . . . , fu G J~o such that, for any loss function w(-), 
(5.1) inf sup E f wh-U\\T n - ff n - inf ||f A - f\\ 2 n )] > c, 

where infr n denotes the infimum over all estimators and the constant c > does not depend 
on M and n. 

Setting w(u) = u in Theorem 15.11 we get the lower bounds for expected squared risks 
showing optimality or near optimality of the remainder terms in the oracle inequalities of 
Corollaries 13.21 13.31 14.21 14.41 The choice of w(u) = I{u > a} with some fixed a > leads 
to the lower bounds for probabilities showing near optimality of the remainder terms in the 
corresponding upper bounds (see Theorems 13 . 51 l4"3)) . 

Proof. We proceed similarly to Tsybakov (2003). The proof is based on the following lemma 
[which can be obtained, for example, by combining Theorems 2.2 and 2.5 in Tsybakov (2004)]. 

Lemma 5.2. Let w be a loss function, A > be such that w(A) > 0, and let C be a finite set 
of functions on X such that N = card(C) > 2, 

\\f-9\\l >4s 2 >0, Vf,g€C, f^g, 

and the Kullback divergences K(Ff,F g ) between the measures Pj and V g satisfy 

K(¥ f ,F g ) < (1/16) log iV, Vf,geC. 

Then for tp = s 2 /A we have 

inf supE/w ip^WTn - fWl >dw(A), 
T n feC L J 

where infT n denotes the infimum over all estimators and c\ > is a constant. 

The (MS) aggregation case. Let H M be the set of vertices of A M , MlogM < n, and ip n ,M = 
(log M)/n. Pick M disjoint subsets Si,... ,Sm of {-Xi, . . . ,X n }, each Sj of cardinality logM 
(w.l.o.g. we assume that log M is an integer) and define the functions 

fj(x) = 7 I{x G Sj}, j = l,...,M, 

where 7 < L is a positive constant to be chosen. Clearly, {fi, ■ ■ ■ , /m} C Tq. Thus, it 
suffices to prove the lower bound of the theorem where the supremum over / £ T§ is replaced 
by that over / G {/1, . . . , /a/}- But for such / we have mini<j<Af — /|| 2 = 0, and 
to finish the proof for the (MS) case, it is sufficient to bound from below the quantity 
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supygjjj j J/ }E/-u;(V ; ri ]vfll^ n ~~ /lira)' where ipn,M = (logM)/n, uniformly over all estimators 
T n . This is done by applying Lemma 15.21 In fact, note that, for j ^ k, 



(5.2) Uj - M l = ?f^E^^. 

Since Wj's are N(0,a 2 ) random variables, the Kullback divergence K(Pf ,¥f k ) between 
and ¥f k satisfies 

(5-3) K(F f] ,F fk ) = ^\\f J -f k \\l j = l,...,M. 

In view of Q5.2JI and (|5.3|) . one can choose 7 small enough to have K(Pf. , P/ fc ) < (1/16) log M 
for j, = 1, . . . , M. Now, to get the lower bound for the (MS) case, it remains to use this 
inequality, identity (|5.2|) and Lemma l5~2l 

The (L) aggregation case. Let H M = M. M and ip n M = M/n. Define the functions /,• = 
ryl{x = Xj}, j = 1,...,M, with < 7 < L and introduce a finite set of their linear 
combinations 

M 

(5.4) « = {j = ^w i / j :uefl}, 

3=1 

where is the set of all vectors G R A:f with binary coordinates u;,- G {0, 1}. Since the 
supports of /j's are disjoint, the functions g G U are uniformly bounded by 7, thus U <Z Tq. 
Clearly, min AgK A/ ||f\ — /||^ = for any / G U. Therefore, similarly to the (MS) case, it is 
sufficient to bound from below the quantity sup^ e ^ E^K;(V ; ~J M ||r n — /||^) where VVi.m = M/n, 
uniformly over all estimators T n . 

Note that for any g\ = J2jL\ Vjfj S U and g% = J2jL\ ^jfj G U we have 

2 M 

(5.5) Hsa - ff2 ||2 = 2- J2(u>j ~ u'j? < I 2 M/n. 

Let first M > 8. Then it follows from the Varshamov-Gilbert bound (see, for instance, 
Tsybakov (2004), Chapter 2) that there exists a subset Uq of U such that card(^o) > 2 M / 8 
and 

(5.6) \\g 1 -g 2 \\ 2 n >C ll 2 M/n. 

for any 51,92 G U$. Using (j5.3|) and (|5.5() we get, for any gi,g 2 G Uq, 

K(F gi ,F g2 ) < C 2l 2 M < C 3 7 2 log(card(^ )), 
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and by choosing 7 small enough, we can finish the proof in the same way as in the (MS) case. 
If 2 < M < 8, we have tp n ,M < 8/n, and the proof is easily obtained by choosing /1 = and 
H = 7n -1 / 2 and applying Lemma 15^21 to the set U$ = {f\, /2}. □ 



Appendix A. 



Lemma A.l. Let f,fi, ■ ■ ■ ,/m 6 J~q an d 1 < m < M. Let C be the finite set of functions 
defined in the proof of Corollary \cl.4\ Then h3. jjj) holds and 



(A.l) 



j2 

mi 5 h - /II 2 < min ||f A - /|| 2 H . 

5 gC AeA M m 



Proof. Let /* be the minimizer of ||f\ — /|| 2 over A £ A M . Clearly, /* is of the form 

M M 

f* = ^Pjfj with pj > and < 1. 

Define a probability distribution on j = 0, 1, . . . , M by 

_jpj 

Consider m i.i.d. random integers ji, . . . ,j m where each is distributed according to {itj} 
on {0, 1, . . . , M}. Introduce the random function 



- m 

/ m = 7 j 9jk 



0j 



fc=i 
where 

if j = 0. 

For every x £ X the random variables gj 1 (x), . . . ,gj m (x) are i.i.d. with E(gj k (x)) = f*(x). 
Thus, 

2 N 



nf m (x)-r(x)) 2 



E 



^E{5i fc (^)-Efe fc (x))} 



fc=l 



m J m 



Hence for every x E X and every f £ To we get 



(A.2) 



nlm{x) - f{x)f 



E(f m (x)-r(x)) 2 + (f*(x)-f(x)y 
L 2 



< — + (f*( x )-f( x )y. 

m 
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Integrating (|A,2|) over fi(dx) and recalling the definition of /* we obtain 

(A.3) E||/ m -/|| 2 < mm ||/ A -/|| 2 + — . 

AeA A/ m 

Finally, note that the random function f m takes its values in C, which implies that 

E||/m-/|| 2 >min|| 9 -/|| 2 . 
g&C 

This and (|A..3|) prove (|A.1|) . The proof of (|3.2|) is analogous, with the only difference that 
(|A.2|) is integrated over the empirical measure rather than over n(dx). □ 
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